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Abstract. The goal of constraint-based sequence mining is to find sequences of 
symbols that are included in a large number of input sequences and that satisfy 
some constraints specified by the user. Many constraints have been proposed in 
the literature, but a general framework is still missing. We investigate the use of 
constraint programming as general framework for this task. 

We first identify four categories of constraints that are applicable to sequence 
mining. We then propose two constraint programming formulations. The first for¬ 
mulation introduces a new global constraint called exists-embedding. This formu¬ 
lation is the most efficient but does not support one type of constraint. To support 
such constraints, we develop a second formulation that is more general but incurs 
more overhead. Both formulations can use the projected database technique used 
in specialised algorithms. 

Experiments demonstrate the flexibility towards constraint-based settings and 
compare the approach to existing methods. 

Keywords: sequential pattern mining, sequence mining, episode mining, con¬ 
strained pattern mining, constraint programming, declarative programming 


1 Introduction 

In AI in general and in data mining in particular, there is an increasing interest in devel¬ 
oping general methods for data analysis. In order to be useful, such methods should be 
easy to extend with domain-specific knowledge. 

In pattern mining, the frequent sequence mining problem has already been studied 
in depth, but usually with a focus on efficiency and less on generality and extensibil¬ 
ity. An important step in the development of more general approaches was the cSpade 
algorithm lfT9l which supports a variety constraints. It supports many constraints such 
as constraints on the length of the pattern, on the maximum gap in embeddings or on 
the discriminative power of the patterns between datasets. Many other constraints have 
been integrated into specific mining algorithms (e.g. 1161171161131 ). However, none of 
these are truly generic in that adding extra constraints usually amounts to changing the 
data-structures used in the core of the algorithm. 

* This paper is published at CPAIOR 2015, this arxiv version additionally has an appendix. 





For itemset mining, the simplest form of pattern mining, it has been shown that con¬ 
straint programming (CP) can be used as a generic framework for constraint-based min¬ 
ing Q and beyond 0141111 . Recent works have also investigated the usage of CP-based 
approaches for mining sequences with explicit wildcards I3l7l8l . A wildcard represents 
the presence of exactly one arbitrary symbol in that position in the sequence. 

The main difference between mining itemsets, sequences with wildcards and stan¬ 
dard sequences lies in the complexity of testing whether a pattern is included in another 
itemset/sequence, e.g. from the database. For itemsets, this is simply testing the sub¬ 
set inclusion relation which is easy to encode in CP. For sequences with wildcards and 
general sequences, one has to check whether an embedding exists (matching of the in¬ 
dividual symbols). But in case only few embeddings are possible, as in sequences with 
explicit wildcards, this can be done with a disjunctive constraint over all possible em¬ 
beddings 10. In general sequence (the setting we address in this paper), a pattern of 
size m can be embedded into a sequence of size n in 0{n"^) different ways, hence 
prohibiting a direct encoding or enumeration. 

The contributions of this paper are as follows: 

- We present four categories of user-constraints, this categorization will be useful to 
compare the generality of the two proposed models. 

- We introduce an exists-embedding global constraint for sequences, and show the 
relation to projected databases and projected frequency used in the sequence mining 
literature to speedup the mining process 161201 . 

- We propose a more general formulation using a decomposition of the exists-embedding 
constraint. Searching whether an embedding exists for each transaction is not easily 
expressed in CP and requires a modihed search procedure. 

- We investigating the effect of adding constraints, and compare our method with 
state-of-the-art sequence mining algorithms. 

The rest of the paper is organized as follows: Sectionl2 formally introduces the sequence 
mining problem and the constraint categories. Section ^explains the basics of encoding 
sequence mining in CP. Section|^and|^present the model with the global constraint and 
the decomposition respectively. Section [^presents the experiments. After an overview 
of related work (Section 0, we discuss the proposed approach and results in Section]^ 

2 Sequence mining 

Sequence mining HI can be seen as a variation of the well-known itemset mining prob¬ 
lem proposed in El. In itemset mining, one is given a set of transactions, where each 
transaction is a set of items, and the goal is to find patterns (i.e. sets of items) that are 
included in a large number of transactions. In sequence mining, the problem is similar 
except that both transactions and patterns are ordered, (i.e. they are sequences instead 
of sets) and symbols can be repeated. For example, (b, a, c, b) and (a, c, c, b, b) are two 
sequences, and the sequence (a, b) is one possible pattern included in both. 

This problem is known in the literature under multiple names, such as embedded 
subsequence mining, sequential pattern mining, flexible motif mining, or serial episode 
mining depending on the application. 









2.1 Frequent sequence mining: problem statement 

A key concept of any pattern mining setting is the pattern inclusion relation. In se¬ 
quence mining, a pattern is included in a transaction if there exists an embedding of 
that sequence in the transaction; where an embedding is a mapping of every symbol in 
the pattern to the same symbol in the transaction such that the order is respected. 

Definition 1 (Embedding in a sequence). Let S = {si,... ,Sm) and S' = (s'l,..., s'„) 
be two sequences of size m and n respectively with m < n. The tuple of integers 
e = (ei,..., Cm) is an embedding of S in S' (denoted S S') if and only if: 

S Qe S' ^ Cl <...< Cm and'ii & 1,... ,m : Si = s'^. (1) 

For example, let S = (a, b) be a pattern, then (2,4) is an embedding of S in (b, a, c, b) 
and (1,4), (1, 5) are both embeddings of S in (a, c, c, b, b). An alternative setting con¬ 
siders sequences of itemsets instead of sequences of individual symbols. In this case, 
the definition is S Qe S' ^ ei < ... < Cn and Vz € 1,..., n : Si C . We do not 
consider this setting further in this paper, though it is an obvious extension. 

We can now define the sequence inclusion relation as follows: 

Definition 2 (Inclusion relation for sequences). Given two sequences S and S', S is 
included in S' (denoted S C S') if there exists an embedding e of S in S': 

SGS' ^3e s.t. S Ee S'. (2) 

To continue on the example above, S = (a, b) is included in both (b, a, c, b) and 
(a, c, c, b, b) but not in (c, b, a, a). 

Definition 3 (Sequential dataset). Given an alphabet of symbols S, a sequential dataset 
D is a multiset of sequences defined over symbols in E. 

Each sequence in D is called a transaction using the terminology from itemset mining. 
The number of transactions in D is denoted \D\ and the sum of the lengths of every 
transaction in D is denoted ||i4|| (||I?|| = X]l=i l^zD- Furthermore, we use dataset as 
a shorthand for sequential dataset when it is clear from context. 

Given a dataset D = {T^,..., r„}, one can compute the cover of a sequence S as 
the set of all transactions Ti that contain S: 

cover(S, D) = {T, & DS G TJ (3) 

We can now define frequent sequence mining, where the goal is to find all patterns 
that are frequent in the database; namely, the size of their cover is sufficiently large. 

Definition 4 (Frequent sequence mining). Given: 

1. an alphabet E 

2. a sequential dataset D = {Ti,..., T„} defined over E 

3. a minimum frequency threshold 9, 

enumerate all sequences S such that |couer(S', D)\ >6. 

In large datasets, the number of frequent sequences is often too large to be ana¬ 
lyzed by a human. Extra constraints can be added to extract fewer, but more relevant or 
interesting patterns. Many such constraints have been studied in the past. 



2.2 Constraints 


Constraints typically capture background knowledge and are provided by the user. We 
identify four categories of constraints for sequence mining: 1 ) constraints over the pat¬ 
tern, 2) constraints over the cover set, 3) constraints over the inclusion relation and 4) 
preferences over the solution set. 

Constraints on the pattern These put restrictions on the structure of the pattern. Typ¬ 
ical examples include size constraints or regular expression constraints. 

Size constraints: A size constraint is simply jS”! ^ a where {=, >, >, <, <} 

and a is a user-supplied threshold. It is used to discard small patterns. 

Item constraints: One can constrain a symbol t to surely be in the pattern: 3s G S : 
s = <; or that it can not appear in the pattern: Vs G S : s t, or more complex logical 
expressions over the symbols in the pattern. 

Regular expression constraints: Let i? be a regular expression over the vocabulary V 
and Lji be the language of sequences recognised by R, then for any sequence pattern S 
over V, the match-regular constraint requires that S G Lfi ii. 

Constraints on the cover set. The minimum frequency constraint \cover{S, D)\ > 6 
is the most common example of a constraint over the cover set. Alternatively, one can 
impose the max/mMm/re^Mency constraint: \cover(S, D)\ < /3 

Discriminating constraints: In case of multiple datasets, discriminating constraints re¬ 
quire that patterns effectively distinguish the datasets from each other. Given two datasets 
Di and D 2 , one can require that the ratio between the size of the cover of both is above 
a threshold: > q, other examples include more statistical measures such 

as information gain and entropy ca. 

Constraints over the inclusion relation. The inclusion relation in definition [^states 
that 5 C 5" •(-> 3e s.t. S Cg S'. Hence, an embedding of a pattern can match symbols 
that are far apart in the transaction. For example, the sequence (a, c) is embedded in 
the transaction (a, b, b, b,..., b, c) independently of the distance between a and c in 
the transaction. This is undesirable when mining datasets with long transactions. The 
max-gap and max-span constraints llT9l impose a restriction on the embedding, and 
hence on the inclusion relation. The max-gap constraint is satisfied on a transaction Ti 
if an embedding e maps every two consecutive symbols in S to symbols in R that are 
close to each-other: max-gapi{e) Vj G 2..\Ti\, (cj — Cj-i — 1) < 7 . For example, 
the sequence (abc) is embedded in the transaction (adddbc) with a maximum gap of 
3 whereas (ac) is not. The max-span constraint requires that the distance between the 
first and last position of the embedding of all transactions R is below a threshold 7 : 
max-spani{e) e\Ti\ — ei -I- 1 < 7 . 

Preferences over the solution set. A pairwise preference over the solution set ex¬ 
presses that a pattern A is preferred over a pattern B. In ifTTl it was shown that con¬ 
densed representations like closed, maximal and free patterns can be expressed as pair¬ 
wise preference relations. Skypatterns iflTll and multi-objective optimisation can also be 
seen as preference over patterns. As an example, let A be the set of all patterns; then, the 
set of all closed patterns is {5 G Z\|^S" s.t. S' IZ S'and cover{S, D) = cover {S', D)}. 
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Fig. 1. Example assignment; blue boxes represent variables, white boxes represent data. 


3 Sequence Mining in Constraint Programming 

In constraint programming, problems are expressed as a constraint satisfaction problem 
(CSP), or a constraint optimisation problem (COP). A CSP X = (V, D, C) consists of 
a set of variables V, a finite domain D that defines for each variable v G V the possible 
values that it can take, and a set of constraints C over the variables in V. A solution 
to a CSP is an assignment of each variable to a value from its domain such that all 
constraints are satisfied. A COP additionally consists of an optimisation criterion f(V) 
that expresses the quality of the solution. 

There is no restriction on what a constraint C can represent. Examples include logi¬ 
cal constraints like X A Y or X —Y and mathematical constraints such as Z = X + Y 
etc. Each constraint has a corresponding propagator that ensures the constraint is sat- 
ished during the search. Many global constraints have been proposed, such as alldif- 
ferent, which have a custom propagator that is often more efficient then if one would 
decompose that constraint in terms of simple logical or mathematical constraints. A 
hnal important concept used in this paper is that of reified constraints. A reihed con¬ 
straint is of the form B o C' where B is a Boolean variable which will be assigned to 
the truth value of constraint C". Reified constraints have their own propagator too. 

Variables and domains for modeling sequence mining. Modeling a problem as a CSP 
requires the dehnition of a set of variables with a hnite domain, and a set of constraints. 
One solution to the CSP will correspond to one pattern, that is, one frequent sequence. 

We model the problem using an array S of integer variables representing the char¬ 
acters of the sequence and an array C of Boolean variables representing which transac¬ 
tions include the pattern. This is illustrated in Pig.[2 

1. Ti and T 2 represent two transactions given as input. We denote the number of 
transactions by n\ 

2. The array of variables S represents the sequence pattern. Each variable Sj repre¬ 
sents the character in the 7 th position of the sequence. The size of S is determined 
by the length of the longest transactions (in the example this is 4). We want to allow 
patterns that have fewer than maXi{\Ti) characters, hence we use e to represent an 
unused position in S. The domain of each variable Sj is thus E U {e}; 

3. Boolean variables Cj represent whether the pattern is included in transaction Tj, 
that is, whether S C T^. In the example, this is the case for Ti but not for T 2 . 

What remains to be dehned is the constraints. The key part here is how to model the 
inclusion relation; that is, the constraint that verifies whether a pattern is included in the 



















transaction. Conceptually, this is the following reified constraint; Ci o 3e s.t. S Cg T^. 
As mentioned in the introduction, the number of possible embeddings is exponential in 
the size of the pattern. Hence, one can not model this as a disjunctive constraint over all 
possible embeddings (as is done for sequences with explicit wildcards |[8l). 

We propose two approaches to cope with this problem: one with a global constraint 
that verifies the inclusion relation directly on the data, and one in which the inclusion 
relation is decomposed and the embedding is exposed through variables. 

4 Sequence mining with a global exists-embedding constraint 

The model consists of three parts: encoding of the pattern, of the minimum frequency 
constraint and hnally of the inclusion relation using a global constraint. 

Variable-length pattern: The array S has length fc; patterns with I < k symbols are 
represented with I symbols from E and {k — 1) times an e value. To avoid enumerating 
the same pattern with e values in different positions, e values can only appear at the end: 

Vj e l..(fc - 1) : Sj = e -)■ Sj+i = e (4) 

Minimum frequency: At least 9 transactions should include the pattern. This inclusion 
is indicated by the array of Boolean variables C: 

n 

^Ci>0 (5) 

i=l 

Global exists-embedding constraint: The goal is to encode the relation: Ci o 3es.t. S Cg 
Ti. The propagator algorithm for this constraint is given in Algorithmic It is an incre¬ 
mental propagator that should be run when one of the S variables is assigned. Line [C 
will loop over the variables in S until reaching an unassigned one at position poss- In 
the sequence mining literature, the sequence (Si..Sposs) is called the prefix. For each 
assigned Sj variable, a matching element in the transaction is sought, starting from the 
position poSe after the element that matched the previous Sj_i assigned variable. If no 
such match is found then an embedding can not be found and Cj is set to false. 



[^propagate the remaining possible symbols from Ti to the hrst unassigned S variable 
in case Ci = True. 


The propagator algorithm has complexity 0{\Ti\): the loop on lineis run up to 
\Ti \ times and on linejCat most \Ti \ times in total, as poSg is monotonically increasing. 

4.1 Improved pruning with projected frequency 

Compared to specialised sequence mining algorithms, poss in Algorithm [C points to 
the hrst position in S after the current prefix. Dually, poSe points to the position after 






Algorithm 1 Incremental propagator for Ci o 3e s.t. S 
internal state, pass-' current position in S to check, initially 1 
internal state, pos^: current position in Ti to match to, initially 1 
1: while poss < |Ti| and S[poss] is assigned do > note that \Ti\ < |S| 

2: if S[poss] / e then 

3: while not (Ti[poSe] = S[poss]) andpoSe < \Ti\ do > find match 

4: pOSe pOSe + 1 

5: end while 

6: if poSe <1211 then t> match found, on to next one 

7: pass •«— poss + 1; poSe ^ poSe + 1 

8: else 

9: propagate Ci = False and return 

10: else l> previous ones matched and rest is e 

11: propagate Ci = T rue and return 

12: end while 

13: if poss > |Sj then > previous ones matched and reached end of sequence 

14: propagate Ci = True and return 

15: if poss > |21| and \Ti\ < |S| then 
16: leti?^S[|ri| + l] 

17: if is assigned and R = e then t> S should not be longer than this transaction 

18: propagate Ci = True and return 

19: if e is not in the domain of R then 

20: propagate Ci = False and return 

21: if Ci is assigned and Ci = True then 

22: propagate by removing from S[poss] all symbols not in {Ti[poSe]..21[|ri|]) except e 


the first match of the prefix in the transaction. If one would project the prefix away, 
only the symbols in the transaction from pos^ on would remain; this is known as prefix 
projection ii. Given prefix (a, c) and transaction (b, a, a, e, c, b, c, b, b) the projected 
transaction is (b, c, b, b). 

The concept of a prefix-projected database can be used to recompute the frequency 
of all symbols in the projected database. If a symbol is present but not frequent in 
the projected database, one can avoid searching over it. This is known to speed up 
specialised mining algorithms considerably 061161 . 

To achieve this in the above model, we need to adapt the global propagator so that 
it exports the symbols that still appear after posg. We introduce an auxiliary integer 
variable Xi for every transaction Ti, whose domain represents these symbols (the set 
of symbols is monotonic ally decreasing). To avoid searching over infrequent symbols, 
we define a custom search routine (brancher) over the S variables. It first computes the 
local frequencies of all symbols based on the domains of the Xi variables; symbols that 
are locally infrequent will not not be branched over. See Appendix [A| for more details. 

4.2 Constraints 

This formulation supports a variety of constraints, namely on the pattern (type 1), on the 
cover set (type 2) and over the solution set (type 4). For example, the type 1 constraint 








min-size, constrains the size of the pattern to be larger than a user-defined threshold a. 
This constraint can be formalised as follows. 


k 

E 

i=i 


;Sj 7^ e] > a 


( 6 ) 


Minimum frequency in Equation Q is an example of a constraint of type 2, over the 


cover set. Another example is the discriminative constraint mentioned in Section 2.2 


given two datasets Di and D 2 , one can require that the ratio between the cover in the 

two datasets is larger than a user defined threshold a: > q, Lgj ]j — 

® \cover{S^D2)\ — 

Di U D 2 and let ti = {i\Ti G Di} and t 2 = {i\Ti G D 2 } then we can extract the 
discriminant patterns from D by applying the following constraint. 




> a 


(7) 


Such a constraint can also be used as an optimisation criterion in a CP framework. 

Type 4 constraints a.k.a. preference relations have been proposed in im to for¬ 
malise well-known pattern mining settings such as maximal or closed patterns. Such 
preference relations can be enforced dynamically during search for any CP formula¬ 
tion CD. The preference relation for closed is S' y S S' C S" A cover(S, D) = 
cover{S', D) and one can reuse the global reified exists-embedding constraint for this. 

Finally, type 3 constraints over the inclusion relation are not possible in this model. 
Indeed, a new global constraint would have to be created for every possible (combina¬ 
tion of) type 3 constraints. For example for max-gap, one would have to modify Algo- 
rithm[T]to check whether the gap is smaller than the threshold, and if not, to search for 
an alternative embedding instead (thereby changing the complexity of the algorithm). 


5 Decomposition with explicit embedding variables 

In the previous model, we used a global constraint to assign the Ci variables to their 
appropriate value, that is: Ci G7 3e s.t. S Cg Tt- The global constraint efficiently tests 
the existence of one embedding, but does not expose the value of this embedding, thus 
it is impossible to express constraints over embeddings such as the max-gap constraint. 

To address this limitation, we extend the previous model with a set of embedding 
variables En, ..., Ei|Xi| that will represent an embedding e = (ei,..., e\Ti\) of se¬ 
quence S in transaction T^. In case there is no possible match for a character Si in T^, 
the corresponding Ey variable will be assigned a no-match value. 

5.1 Variables and constraints 

Embedding variables. For each transaction Ti of length IT^I, we introduce integer 
variables En, ... ,Ei|Xi|. Each variable Ey is an index in Ti, and an assignment to 
Eij maps the variable Sj to a position in Tp, see Figure]^ the value of the index is 
materialized by the red arrows. The domain of Ey is initialized to all possible positions 
of Ti, namely 1 ,..., |Ti | plus a no-match entry which we represent by the value |Ti | -f 1. 






p=l p=2 p=3 p=4 


Cl : 1 

C 2 ; 0 



Fig. 2. Example assignment; blue boxes represent variables, white boxes represent data. The cur¬ 
sive values in Ei and E 2 represent the no-match value for that transaction. 


The position-match constraint. This constraint ensures that the variables Ei either 
represent an embedding e such that S Cg or otherwise at least one Ey has the no¬ 
match value. Hence, each variable Ey is assigned the value x only if the character in 
Si is equal to the character at position x in T^. In addition, the constraint also ensures 
that the values between two consecutive variables Ey, Ei^+ij are increasing so that the 
order of the characters in the sequence is preserved in the transaction. If there exist no 
possible match satisfying these constraints, the no-match value is assigned. 

Vf e l,...,n,Vje l,...,|Ti| : (Sj = T^Ey]) V (Ey = Ir^l + 1) ( 8 ) 

VfGl,...,n,VjG2,...,|Ti|: (Eyj.i) < Ey) V (Ey = IT^I + 1) (9) 

Here Sj = Ti[Ey] means that the symbol of Sj equals the symbol at index Ey in 
transaction T^. See Appendix [B] for an effective reformulation of these constraints. 

Is-embedding constraint. Finally, this constraint ensures that a variable Cj is true 
if the embedding variables Eji,..., EyXii together form a valid embedding of sequence 
S in transaction Ti. More precisely: if each character Sj 7 ^ e is mapped to a position in 
the transaction that is different from the no-match value. 

Vz G 1,... ,n : Ci GG Vj G 1,..., |T,| : (Sj 7 ^ e) ^ (Ey ^ \T,\ + 1) (10) 

Note that depending on how the Ey variables will be searched over, the above con¬ 
straints are or are not equivalent to enforcing Ci GG 3e s.t. S Cg This is explained 
in the following section. 

5.2 Search strategies for checking the existence of embeddings 

CP’s standard enumerative search would search for all satisfying assignments to the 
Sj, Ci and Ey variables. As for each sequence of size m, the number of embeddings in 
a transaction of size n can be 0{rf^), such a search would not perform well. Instead, 
we only need to search whether one embedding exists for each transaction. 

With additional constraints on Ey hut not Ci. When there are additional constraints 
on the Ey variables such as max-gap, one has to perform backtracking search to hnd 
a valid embedding. We do this after the S variables have been assigned. 























We call the search over the S variables the normal search, and the search over the 
Eij variables the sub search. Observe that one can do the sub search for each transaction 
i independently of the other transactions as the different Ei have no influence on each 
other, only on Ci. Hence, one does not need to backtrack across different sub searchers. 

The goal of a sub search for transaction i is to find a valid embedding for that 
transaction. Hence, that sub search should search for an assignment to the Ey variables 
with Ci set to true first. If a valid assignment is found, an embedding for Ti exists and 
the sub search can stop. If no assignment is found, Ci is set to false and the sub search 
can stop too. See Appendix [C| for more details on the sub search implementation. 

With arbitrary constraints. The constraint formulation in Equation ( [TOl l is not equiv¬ 
alent to Ci o 3e s.t. S Cg Ti. For example, lets say some arbitrary constraint propa¬ 
gates Ci to false. For the latter constraint, this would mean that it will enforce that S 
is such that there does not exists an embedding of it in Ti. In contrast, the constraint 
in Equation ( [T0| l will propagate some Eq to the no-match value, even if there exists a 
valid match for the respective Sj in Ti ! 

To avoid an Eq being set to the no-match value because of an assignment to Ci, 
we can replace Equation ( [TOl i by the half-reified Vi : Ci — )■ (Vj (Sj 7^ e) —> (Eq 7^ 
\Ti \ -t-1 ) ) during normal search. 

The sub search then has to search for a valid embedding, even if Ci is set to false 
by some other constraint. One can do this in the sub search of a specific transaction i by 
replacing the respective half-reified constraint by the constraint c; o (Vj (Sj 7^ e) ^ 
(Eq 7 ^ |Ti I -f 1) ) over a new variable C( that is local to this sub search. The sub search 
can then proceed as described above, by setting C( to true and searching for a valid 
assignment to Ei. Consistency between C'; and the original Ci must only be checked 
after the sub search for transaction i is finished. This guarantees that for any solution 
found, if Ci IS false and so is C( then indeed, there exists no embedding of S in T^. 

5.3 Projected frequency 

Each Eq variable represents the positions in Ti that Sj can still take. This is more 
general than the projected transaction, as it also applies when the previous symbol in 
the sequence Sj_i is not assigned yet. Thus, we can also use the Eq variables to require 
that every symbol of Sj must be frequent in the (generalised) projected database. This 
is achieved as follows. 


Vj G 1 ... n, Vx G 17, Sj = X —>■ |{C Ci A Ti[Eq] = x}\>0 (11) 

See Appendix [D] for a more effective reformulation. 


5.4 Constraints 


All constraints from Section 4.2 are supported in this model too. Additionally, con¬ 
straints over the inclusion relations are also supported; for example, max-gap and 


max-span. Recall from Section 2.2 that for an embedding e = (ei,..., e^), we have 




max-gapi{e) Vj G 2 ... \Ti\, (cj — ej-i — 1) < 7 . One can constrain all the em¬ 
beddings to satisfy the max-gap constraint as follows (note how x is smaller than the 
no-match value \Ti\ -f 1 ): 

Vz e 1... n, Vj e 2 ... |Ti|, X G 1... |Ti| : Ey = x —>■ x — < 7 -f 1 (12) 

Max-span was formalized as max-spani{e) e|Ti| — Ci -f 1 < 7 and can be formu¬ 
lated as a constraint as follows: 

Vz G 1.. . n, Vj G 2 ... |Ti|, X G 1... |Ti| : Ey = x —>■ x — Eii < 7 — 1 (13) 

In practice, we implemented a simple dijference-except-no-match constraint that achieves 
the same without having to post a constraint for each x separately. 


6 Experiments 

The goal of these experiments is to answer the four following questions: Ql: What is 
the overhead of exposing the embedding variables in the decomposed model? Q2: What 
is the impact of using projected frequency in our models? Q3: What is the impact of 
adding constraints on runtime and on number of results? Q4: How does our approach 
compares to existing methods? 

Algorithm and execution environment: All the models described in this paper have been 
implemented in the Gecode solve][i] We compare our global and decomposed models 
(Section and Section]^ to the state-of-the-art algorithms cSnade lfT^ and PrehxS- 
pan El. We use the author’s cSpade implementatior0and a publicly available PrehxS- 
pan implementation by Y. Tabe0 We also compare our models to the CP-based ap¬ 
proach proposed by Da. No implementation of this is available so we reimplemented 
it in Gecode. Gecode does not support non-deterministic automata so we use a more 
compact DFA encoding that requires only 0{n * |i7|) transitions, by constructing it 
back-to-front. We call this approach regular-dfa. Unlike the non-deterministic version, 
this does not allow the addition of constraints of type 3 such as max-gap. 

All algorithms were run on a Linux PC with 16 GB of memory. Algorithm runs 
taking more than 1 hour or more than 75% of the RAM were terminated. The imple¬ 
mentation and the datasets used for the experiments are available online]^ 

Datasets: The datasets used are from real data and have been chosen to represent a 
variety of application domains. In Unix usei[^ each transaction is a series of shell com¬ 
mands executed by a user during one session. We report results on User 3; results are 
similar for the other users. JMLR is a natural language processing dataset; each trans¬ 
action is an abstract of a paper from the Journal of Machine Learning Research. iPRG 

’ http://www.gecode.org 

^ http://www.cs.rpi.edu/ zaki/www-new/pmwiki.php/Software/ 

^ https://code.google.com/p/prefixspan/ 
https://dtai.cs.kuleuven.be/CP4IM/cpsm 
^ https://archive.ics.uci.edu/ml/datasets/ 



dataset 

1^1 

1^1 

ll^ll 

max \T\ 

TGD 

avg |T| 

density 

Unix user 

265 

484 

10935 

1256 

22.59 

0.085 

JMLR 

3847 

788 

75646 

231 

96.00 

0.025 

iPRG 

21 

7573 

98163 

13 

12.96 

0.617 

FIFA 

20450 

2990 

741092 

100 

36.239 

0.012 


Table 1. Dataset characteristics. Respectively: dataset name, number of distinct symbols, number 
of transactions, total number of symbols in the dataset, maximum transaction length, average 
transaction length, and density calculated by | . 


is a proteomics dataset from the application described in ID; each transaction is a se¬ 
quence of peptides that is known to cleave in presence of a Trypsin enzyme. FIFA is 
click stream datasej^from logs of the website of the FIFA world cup in 98; each trans¬ 
action is a sequence of webpages visited by a user during a single session. Detailed 
characteristics of the datasets are given in Table Remark that the characteristic of 
these datasets are very diverse due to their different origins. 

In our experiments, we vary the minimum frequency threshold (minsup). Lower 
values for minsup result in larger solution sets, thus in larger execution times. 

Experiments: First we compare the global and the decomposed models. The execution 
times for these models are shown on Fig.[^ both without and with projected frequency 
(indicated by -p.f.). We first look at the impact of exposing the embedding variables in 
the decomposed model (Ql). Perhaps unsurprisingly, the global model is up to one or¬ 
der of magnitude faster than the decomposed model, which has 0{n*k) extra variables. 
This is the overhead required to allow one to add constraints over the inclusion relation. 
We also study the impact of the projected frequency on both models (Q2). In the global 
model this is done as part of the search, while in the decomposed model this is achieved 
with an elaborate constraint formulation. For global-p.f we always observe a speedup 
in Fig.[^ Not so for decomposed-p.f. for the two largest (in terms of | |i9||) datasets. 


Unix user jmLR iPRG FIFA 



Minsup (%) Minsup (%) Minsup (%) Minsup (%) 


Fig. 3. Global model vs. decomposed model: Execution times. (Timeout 1 bour.) 


We now evaluate the impact of user constraints on the number of results and on the 
execution time (Q3). Fig. shows the number of patterns and the execution times for 

^ http://www.philippe-fournier-viger.com/spmf/ 
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Fig. 4. Number of patterns (top) and execution times (bottom) for the decomposed model with 
various comhinations of constraints. 


various combinations of constraints. We can see that adding constraints enables users to 
control the explosion of the number of patterns, and that the execution times decrease 
accordingly. The constraint propagation allows early pruning of invalid solutions which 
effectively compensates the computation time of checking the constraints. For example, 
on the Unix user dataset, it is not feasible to mine for patterns at 5% minimum frequency 
without constraints, let alone do something with the millions of patterns found. On the 
other hand, by adding constraints one can look for interesting patterns at low frequency 
without being overwhelmed by the number of results (see also later). 

The last experiment compares our models to existing algorithms. Fig. shows 
the execution times for our global model compared with regular-dfa, PrefixSpan and 
cSpade (Q4). First, we can observe that regular-dfa is always slowest. On iPRG it per¬ 
forms reasonably well, but the number of transitions in the DFAs does not permit it to 
perform well on datasets with a large alphabet or large transactions, such as Unix user, 
JMLR or FIFA. Furthermore, it can not make use of projected frequencies. 

global shows similar, but much faster, behaviour than regular-dfa. On datasets with 
many symbols such as JMLR and FIFA, we can see that not using projected frequency 
is a serious drawback; indeed, global-p.f. performs much better than global there. 

Of the specialised algorithms, cSpade performs better than PrefixSparr, it is the most 
advanced algorithm and is the fastest in all experiments (not counting the highest fre¬ 
quency thresholds), global-p.f. has taken inspiration from PrefixSpan and we can see 
that they indeed behave similarly. Although, for the dense iPRG dataset PrefixSpan per¬ 
forms better than global-p.f. and inversely for the large and sparse FIFA dataset. This 
might be due to implementation choices in the CP solver and PrefixSpan software. 

Analysis of the pattern quality Finally, we use our constraint-based framework to per¬ 
form exploratory analysis of the Unix user datasets. Table|^shows different settings we 
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Fig. 5. Global model vs. other approaches. Execution times. (Timeout 1 hour.) 


tried and patterns we found interesting. Few constraints lead to too many patterns while 
more constrained settings lead to fewer and more interesting patterns. 


7 Related work 

The idea of mining patterns in sequences dates from earlier work by Agrawal et al. HI 
shortly after their well-known work on frequent itemset mining m. The problem in¬ 
troduced in m consisted of finding frequent sequences of itemsets', that is: sequences 
of sets included in a database of sequences of sets. Mining sequences of individual 
symbols was introduced later by 0; the two problems are closely related and one can 
adapt one to the other M- Sequence mining was driven by the application of market 
basket analysis for customer data spread over multiple days. Other applications include 
bio-medical ones where a large number of DNA and protein sequence datasets are avail¬ 
able (e.g. mi or natural language processing where sentences can be represented as 
sequences of words (e.g. ini). 

Several specialised algorithm have addressed the problem of constrained sequence 
mining. The cSpade algorithm IHSI for example is an extension of the Spade sequence 
mining algorithm Il20l that supports constraints of type 1, 2 and 3. PrefixSpan men¬ 
tions regular expression constraints too. The LCMseq algorithm m also supports a 
range of constraints, but does not consider all embeddings during search. Other se¬ 
quence mining algorithms have often focussed on constraints of type 4, and on closed 
sequence mining in particular. CloSpan ifTTIl and Bide ifThl are both extentions of Pre- 


setting 

# of patterns 

interesting pattern 

comment 

Fi 

627 

- 

Too many patterns 

F 2 

512 

- 

Long sequences of cd and Is 

Fs 

36 

(latex, bibtex, latex) 

User2 is using Latex to write a paper 

Di 

7 

(emacs) 

User2 uses Emacs, his/her collaborators use vi 

D 2 

9 

(quota, rm. Is, quota) 

User is out of disc quota 


Table 2. Patterns with various settings (User 2): Fi: minfreq = 5%, F 2 : Fi Amm-size = 3, 
F 3 : F 2 A max-gap = 2 A max-span = 5, Di: minfreq = 5% A discriminant = 8 (w.r.t. 
all other users), D 2 : minfreq = 0.4% A discriminant = 8 A member (quota) 
















fixSpan to mine closed frequent sequences. We could do the same in our CP approach 
by adding constraints after each solution found, following HM- 

Different flavors of sequence mining have been studied in the context of a generic 
framework, and constraint programming in particular. They all study constraints of type 
1, 2 and 4. In 13 the setting of sequence patterns with explicit wildcards in a single se¬ 
quence is studied; such a pattern has a linear number of embeddings. As only a single 
sequence is considered, frequency is dehned as the number of embeddings in that se¬ 
quence, leading to a similar encoding to itemsets. This is extended in Q to sequences 
of itemsets (with explicit wildcards over a single sequence). il also studies patterns 
with explicit wildcards, but in a database of sequences. Finally, ifTOll considers standard 
sequences in a database, just like this paper; they also support constraints of type 3. 
The main difference is in the use of a costly encoding of the inclusion relation using 
non-deterministic automata and the inherent inability to use projected frequency. 


8 Conclusion and discussion 

We have investigated a generic framework for sequence mining, based on constraint 
programming. The difficulty, compared to itemsets and sequences with explicit wild¬ 
cards, is that the number of embeddings can be huge, while knowing that one embed¬ 
ding exists is sufficient. 

We proposed two models for the sequence mining problem: one in which the exists- 
embedding relation is captured in a global constraint. The benefit is that the complexity 
of dealing with the existential check is hidden in the constraint. The downside is that 
modifying the inclusion relation requires modifying the global constraint; it is hence 
not generic towards such constraints. We were able to use the same projected frequency 
technique as well-studied algorithms such as PrefixSpan by altering the global 
exists-embedding constraint and using a specialised search strategy. Doing this does 
amount to implementing specific propagators and search strategies into a CP solver, 
making the problem formulation not applicable to other solvers out-of-the-box. On the 
other hand, it allows for signihcant efficiency gains. 

The second model exposes the actual embedding through variables, allowing for 
more constraints and making it as generic as can be. However, it has extra overhead and 
requires a custom two-phased search strategy. 

Our observations are not just limited to sequence mining. Other pattern mining tasks 
such as tree or graph mining also have multiple (and many) embeddings, hence they will 
also face the same issues with a reihed exists relation. Whether a general framework 
exists for all such pattern mining problems is an open question. 
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Appendix 


A Branching with projected frequency 


We want to branch only over the symbols that are &t\\\ frequent in the prefix-projected 
sequences. Taking the current partially assigned sequence into account, after projecting 
this prefix away from each transaction, some transactions will be empty and others will 
have only some subset of its original symbols left. 

For each propagator Ci o 3e s.t. S Cg Ti we maintain the (monotonically decreas¬ 
ing) set of symbols for that transaction in a variable Xi. The propagator in Algorithmic 
needs just a one line addition, that is, after line 12 we add the following: 


propagate by removing from Xi all symbols not in {Ti[poSe\..TffTi\\) except e 
which removes all symbols from Xi that do not appear after the current prefix. 

The brancher than first computes the local frequency of each symbol across all Xi, 
and only branches on the frequent ones. Let 0 be the minimum frequency threshold, 
then the branching algorithm is the following: 


Algorithm 2 local-frequency-brancher(S, X) 

1: poss ^ position of first unassigned variable in S 
2: for s in S[poss] do 
3: count ■<— 0 

4: for all Xi do 

5: if s G D(Xi) then > symbol in domain of Xi 

6: count count + 1 

7: end for 

8: if count >= 9 then 

9: add branch-choice ’S = s’ 

10: end for 

11: branch over all branch-choices, (if any) 


B Decomposition with explicit embedding variables, modeling 
details 

The decomposition consists of two constraints: the position-match constraint and the 
is-embedding constraint. 


position-match formulation, part 1 The first constraint needed to enforce 
position-match is formally defined as follows: 


VzGl,...,n,Vjel,...,|T,|: (Sj = r,[Eij]) V (Ey = |T,| + 1) (14) 






Instead of modeling this with a reified element constraint, we can decompose the ele¬ 
ment constraint over all values in Ey except the no-match value \Ti \ -f 1; 

Vi e 1,... ,n,Vj G 1,..., iTil.Va; G 1... |r.| : Ey = x ^ Sj = T,[x\ (15) 

Observe that in the above formulation Ti [a;] is a constant, so the reified Sj = u expres¬ 
sions can be shared for all unique values of r; G Z". 

Furthermore, using half-reified constraints we need only one auxiliary variable for 
both Ey = a; —> B and B Sj = s, where the latter can be shared for all unique 
values of s G Z". This leads Xo 0{n ■ k ■ k) half-reified constraints of the former type 
and 0{k ■ k) auxiliary variables and half-reified constraints of the latter type, with k = 

maXi{\Ti\). 


position-match formulation, part 2 The second constraint needed to enforce 

position-match is: 

VzG l,...,n,Vj G2,...,|r,| : (Eiy.i) < Ey) V (Ey = |T,| + 1) (16) 

Formulating this in CP would not perform any propagation until |Ti | -f 1 is removed 
from the domain of Ey. However, one can see that the lower-bound on Ei(j_i), when 
not equal to \Ti\ -f 1, can be propagated to the lower-bound of Ey. 

Consider the following example: let S = [{B,C,e}, {A, B,C, e}, {A, B, C, e}] 
and Ti = [A, B, C] then fc = 3 and £>(^ 1 ) = {2,3,4}, £>(£ 2 ) = {2,3,4}, D{E^) = 
{2,3,4}. However, because min{D{Ei)) = 2 we know that £2 7 ^ 2 and similar for 
£ 3 . This leads to £3 = {4}, from which the is-embedding propagator can derive that 
there is no embedding of the pattern in £ 3 . This is a quite common situation. 

This propagation can be obtained with the following decomposition over all ele¬ 
ments of the domain (except \Ti\ -f 1 ): 

Vz G 1. . .n,Vj G 1... |T,| - l,Vx G 1... |r,| : (Ey+i = x) ^ (Ey < x) (17) 

However, this would require in the order 0{n ■ reified constraints and auxiliary 
variables. 

Instead, we use a simple modification of the binary inequality propagator X < Y 
that achieves the same required result. This propagator always propagates the lower- 
bound of X to Y, and as soon as |Ti| -f 1 ^ E it propagates like a standard X < Y 
propagator. 

There are 0{n • k) such constraints needed and no auxiliary variables. 

is-embedding formulation The constraint is the following: 

VzGl,...,rz: Q GG Vj G 1,..., |T,| : (Sj ^ e) ^ (Ey ^ |T,| + 1) (18) 

Across all transactions, the reified Sj 7 ^ e expressions can be shared. 0{n ■ k) such 
constraints and auxiliary variables are needed in total. For each transaction, the forall 
requires \Ti\ times 2 auxiliary variables, one for reifying Ey 7 ^ |Ti| -f 1 and one for 
reifying the implication. This leads to an additional 0{n ■ k) auxiliary variables and 
constraints, plus n reified conjunction constraints. 



C Sub-search for the existence of a valid Ei 


For each transaction i independently, we can search for a valid assignment of the Ei 
variables. As soon as a valid one is found, the sub search can stop and propagate the 
corresponding assignment to the Ci and Ei variables of the master problem. 

The following pseudo-code describes how we implemented this scheme as a brancher 
in a copying solver (implementation for a trailing solver is similar): 


Algorithm 3 sub-search-brancher(C, E) 

1 : substate t— copy of the current search state 

2: for all Ci do 

3: remove all other branchers (e.g. variable/value orderings) 

4: add to substate the variable/value ordering that tries Ci =true before C, —false 

5: add to substate as next variable/value ordering to search over all Ey variables in lexico¬ 

graphic order, trying their smallest value first 
6 : solve substate 

7: if i-uteflfe has a solution then 

8 : save substate’?, assignment of C, and Ei 

9: else 

10: fail the master problem > When no valid Ci,Ei can be found 

11: end for 

12 : merge all saved assignments and have this as the only resulting branch-choice for the master 
problem 


In the above algorithm, for each transaction i we enter the loop and remove all 
branchers, meaning that there are currently no search choices for the subproblem. We 
then force the sub-search to only search over Ci and Ei, such that an assignment for 
Ci =true is found first, if it exists. By removing all branchers at the start of loop, the 
next transaction’s sub-search will not reconsider branching choices made in the previous 
sub-search. 

As the master problem should not branch over any of the sub-search choices either, 
we merge all the assignments found by the sub-searches and present this as the only 
branch-choice for the master problem. 

Using this sub-search-brancher, for each Ti for which an embedding of S in Ti 
exists, Ci will be true. Only if no such embedding exists will Ci ht false. This is the 
required behaviour for our constraint formulation. 

D Projected frequency for explicit embedding variables 

We introduced the following constraint specification: 

Vj G l...n,a; e U, Sj = a; ^ |{f : Ci A T^Ey] = a;}| > 6» (19) 

A naive formulation of this expression would require reifying an element constraint 
B 4-> Ti[Eij] = X. Instead, we will create element constraints ^^[Eij] = Ay, where Ay 






is an auxiliary integer variable. This leads to the following more efficient reformulation: 

VzG T4Eij]=Aij (20) 

Vi G 1... n, j G 1... |Ti|, x G 27, Sj = a: —>■ |{i : Cj A Ay = a:}| > 0 (21) 
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